Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat)

نویسندگان

  • Jeroen Geertzen
  • Theodora Alexopoulou
  • Anna Korhonen
چکیده

∗Naturalistic learner productions are an important empirical resource for SLA research. Some pioneering works have produced valuable second language (L2) resources supporting SLA research.1 One common limitation of these resources is the absence of individual longitudinal data for numerous speakers with different backgrounds across the proficiency spectrum, which is vital for understanding the nature of individual variation in longitudinal development.2 A second limitation is the relatively restricted amounts of data annotated with linguistic information (e.g., lexical, morphosyntactic, semantic features, etc.) to support investigation of SLA hypotheses and obtain patterns of development for different linguistic phenomena. Where available, annotations tend to be manually obtained, a situation posing immediate limitations to the quantity of data that could be annotated with reasonable human resources and within reasonable time. Natural Language Processing (NLP) tools can provide automatic annotations for parts-of-speech (POS) and syntactic structure and are indeed increasingly applied to learner language in various contexts. Systems in computer-assisted language learning (CALL) have used a parser and other NLP tools to automatically detect learner errors and provide feedback accordingly.3 Some work aimed at adapting annotations provided by parsing tools to accurately describe learner syntax (Dickinson & Lee, 2009) or evaluated parser performance on learner language and the effect of learner errors on the parser. Krivanek and Meurers (2011) compared two parsing methods, one using a hand-crafted lexicon and one trained on a corpus. They found that the former is more successful in recovering the main grammatical dependency relations whereas the latter is more successful in recovering optional, adjunction relations. Ott and Ziai (2010) evaluated the performance of a dependency parser trained on native German (MaltParser; Nivre et al., 2007) on 106 learner answers to a comprehension task in L2 German. Their study indicates that while some errors can be problematic for the parser (e.g., omission of finite verbs) many others (e.g., wrong word order) can be parsed robustly, resulting in overall high performance scores. In this paper we have two goals. First, we introduce a new English L2 database, the EF Cambridge Open Language Database, henceforth EFCAMDAT. EFCAMDAT was developed by the Department of Theoretical and Applied Linguistics at the University of Cambridge in collaboration with EF Education First, an international educational organization. It contains writings submitted to Englishtown, the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Longitudinal L2 Development of the English Article in Individual Learners

We investigate the accuracy development of the English article by learners of English as a second language. The study focuses on individual learners, tracking their learning trajectories through their writings in the EF-Cambridge Open Language Database (EFCAMDAT), an open access learner corpus. We draw from 17,859 writings by 1,280 learners and ask whether article accuracy in individual learner...

متن کامل

Native Language Identification Using Large, Longitudinal Data

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to the recently released EFCamDat corpus which is not only multiple times larger than previous L2 corpora but also provides longitudinal data at several...

متن کامل

Functional Information in SWISS-PROT: the Basis for Large-scale Characterisation of Protein Sequences

With the rapid growth of sequence databases, there is an increasing need for reliable functional characterisation and annotation of newly predicted proteins. To cope with such large data volumes, faster and more effective means of protein sequence characterisation and annotation are required. One promising approach is automatic large-scale functional characterisation and annotation, which is ge...

متن کامل

Linguistic Annotation of Two Prosodic Databases

Two prosodic databases were annotated with linguistic information using SGML (Standard General Markup Language), one database of American English and one of Modern Standard German. Only information that might have prosodic correlates was annotated. Pho-netic and morphological information was supplied by automatic tools and then hand corrected. Semantic and pragmatic information was inserted by ...

متن کامل

The Effects of Multimedia Annotations on Iranian EFL Learners’ L2 Vocabulary Learning

In our modern technological world, Computer-Assisted Language learning (CALL) is a new realm towards learning a language in general, and learning L2 vocabulary in particular. It is assumed that the use of multimedia annotations promotes language learners’ vocabulary acquisition. Therefore, this study set out to investigate the effects of different multimedia annotations (still picture annotatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013